Skip to content

Conversation

@yaooqinn
Copy link
Contributor

@yaooqinn yaooqinn commented Feb 3, 2026

Summary

Adds Spark-specific transform function that supports the 2-arg lambda signature transform(array, (x, i) -> expr) with element + index.

Background

Spark's transform function supports two signatures:

  1. transform(array, x -> expr) - element only
  2. transform(array, (x, i) -> expr) - element + index

Velox's Presto implementation only supports signature #1. This PR adds a Spark-specific implementation that supports both.

Changes

  • Created velox/functions/sparksql/Transform.cpp with SparkTransformFunction
  • Uses runtime arity detection to provide index vector when needed
  • Uses 32-bit integer for index (matching Spark's IntegerType)
  • Added 6 test cases covering both signatures and null handling

Test Plan

-- Spark behavior this enables:
spark-sql> SELECT transform(array(1, 2, 3), (x, i) -> x + i);
[1, 3, 5]

All 6 new tests pass:

  • elementOnly - basic transform without index
  • elementWithIndex - transform using element + index
  • withNulls - null handling in arrays
  • nestedArrays - nested array transformation
  • indexOnly - using only index (ignore element)
  • emptyArray - empty array edge case

@netlify
Copy link

netlify bot commented Feb 3, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit c28d2c1
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6981ce81070e0e0008342e56

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026
@yaooqinn yaooqinn changed the title Add Spark-specific transform with index parameter support feat(sparksql): Add transform with index parameter support Feb 3, 2026
@yaooqinn yaooqinn force-pushed the spark-transform-with-index branch from 878130c to 7254aa4 Compare February 3, 2026 04:34
Spark's transform function supports two signatures:
1. transform(array, x -> expr) - element only
2. transform(array, (x, i) -> expr) - element + index

Velox's Presto implementation only supports signature #1. This adds
a Spark-specific implementation that supports both signatures.

Key changes:
- Created velox/functions/sparksql/Transform.cpp with SparkTransformFunction
  that detects lambda arity at runtime and provides index vector when needed
- Uses 32-bit integer for index (matching Spark's IntegerType)
- Added comprehensive tests for both signatures including null handling

Tested with 6 new test cases, all passing.
@yaooqinn yaooqinn force-pushed the spark-transform-with-index branch from 7254aa4 to 0b50680 Compare February 3, 2026 06:31
Copy link
Collaborator

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feature. Please also update the document.

}
};

// Test transform with element-only lambda: transform(array, x -> x * 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

End with .

namespace facebook::velox::functions::sparksql {
namespace {

/// Spark's transform function supporting both signatures:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supporting -> supports

///
/// See Spark documentation:
/// https://spark.apache.org/docs/latest/api/sql/index.html#transform
class SparkTransformFunction : public exec::VectorFunction {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Spark prefix since it has been in sparksql namespace

/// https://spark.apache.org/docs/latest/api/sql/index.html#transform
class SparkTransformFunction : public exec::VectorFunction {
public:
void apply(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reuse some code of udf_transform, move some of them to common path velox/functions/lib?

- Rename SparkTransformFunction to TransformFunction (in sparksql namespace)
- Fix grammar: 'supporting' -> 'supports'
- Add period at end of test comment
@yaooqinn
Copy link
Contributor Author

yaooqinn commented Feb 3, 2026

I analyzed the code reuse opportunity. The common utilities are already shared in velox/functions/lib/:

  • flattenArray() from LambdaFunctionUtil.h
  • toElementRows(), getElementToTopLevelRows() from RowsTranslationUtil.h
  • toWrapCapture(), addNullsForUnselectedRows() from LambdaFunctionUtil.h

The Spark-specific additions are:

  1. Index detection via functionType->size() == 3
  2. createIndexVector() helper for the (x, i) signature

Moving the remaining lambda iteration loop to lib would add complexity without significant benefit since it's ~15 lines of straightforward code. The current structure follows the same pattern as other Spark-specific lambda functions like filter.

If you'd prefer a different approach (e.g., a templated base class), I'm happy to refactor.

std::vector<VectorPtr> lambdaArgs = {flatArray->elements()};

// If lambda expects index, create index vector.
if (withIndex) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the only change with presto TransformFunction is here, am I right? So I would prefer a base TransformFunction and add a virtual addIndexVectors(lambdaArgs) with default empty implementation,

context.moveOrCopyResult(localResult, rows, result);
}

static std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the signatures, call the super function signatures() and add the spark specified function signature

Address reviewer feedback to share code between Presto and Spark transform:
- Create TransformFunctionBase in velox/functions/lib/ with virtual addIndexVector()
- Presto TransformFunction inherits base directly (no override needed)
- Spark TransformFunction overrides addIndexVector() to add index parameter
- Spark signatures() calls base and adds the (T, integer, U) signature

All 13 transform tests pass (6 Spark + 7 Presto).
@yaooqinn yaooqinn force-pushed the spark-transform-with-index branch from cc7afe8 to c28d2c1 Compare February 3, 2026 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants